Talk:Paper:Super-scalar RAM-CPU cache compression
= Eyal's questions and comments = Eyalroz (talk) 16:46, October 7, 2015 (UTC) Section 1: Intro * TPC-H high-scorers use numerous disks to increase disk<->mem bandwidth (e.g. dozens for 100GB SF), but this is costly. Q: How do they use these disks? Custom RAID arrays? But still, how does it reach memory? Is it over PCIe? What does the hardware look like? * These schemes generalize DICT, FOR and PS (prefix suppression) schemes, known already. Q: What does the P stand for, then? TODO: Read up on those Section 2: Background * The trend of pipeline deepening has actually been stayed, i.e. the Intel Core processor families have much shallower pipelines than the Pentium 4. However, this is now a feature of essentially any processor (including GPUs, Xeon Phi cores etc.) * By now almost all processors are multi-issue, i.e. everything is "super-scalar". * "Predicated" execution of both branches of an if-then-else statement is used a lot in GPUs and other massively-parallel processors, where many "hardware threads" execute in lockstep (common instruction pointer). It is mostly useful when the branches don't involve LOAD, STORE and JMP commands. * Is it really that challenging to decompress in a tight loop with independent iterations? If the length of dictionary elements in an LZ77-style compression scheme can be bounded, one could possibly have a decompression loop followed by a data reordering operation for avoiding gaps. Of course - this might saturate memory bandwidth faster. Q: What's more important: Avoiding branching, or utilizing memory bandwidth as much as possible? A: Eyal believes it's the latter, so as CPUs gain instruction-issues-per-cycle, it becomes ok to have small branches - we should still be able to saturate the memory bandwidth with useful requests. Section 3: Super-scalar compression Q: Re Figure 2: How is it possible for a comp. scheme to have a ratio of 42.8 and still be so much slower in decompression? I mean, essentially it "just writes" data. It seems like this sluggishness could be optimized away. Q: Re Figure 2: Are the reference scheme figures (zlib, bzip2, etc.) Q: Re Figure 3: If the exception section grows backwords, why aren't the values in exception cells 5,4,3,2,1,0? A: The value at an exception cell is the number of non-exceptions after it, before the next exception. * If you have predication, exception-patching may not be faster than a patching run, because for the second run you're fetching data from memory again (even if only from L2/L3 cache), and only one value per transaction rather than a full cache line - while with predication it's in registers already. Q: Doesn't the example of the double-cursor risk cache conflicts between the two cursors? * The exception-list walk for random access seems a bit expensive, despite authors' reassurance and the fact that the data is cached. Q: Why is it a problem to compress floating-point data with PFOR? Comments & questions regarding the paper at large Q: What are other lightweight compression schemes known before this article was written ("previous speed-tuned compression algorithms")? Q: What about fixed-dictionary LZ77-family? Q: How do you make sure your decompression stays in the cache? i.e. how do you "manage" a cache where evictions and loads happen automatically? * These compression schemes make assumptions regarding the data. For example, using a reference value assumes the distribution of data is centered around it. But this assumption and the extent of its validity is not discussed in the paper is it?. Q: How different is DICT/PDICT from an LZ77-family compression, after all? Q: What about using Huffman coding (with occasional alignment and indicating the length of the next block of Huffman'ed values? Either independently of other methods or after them? Q: Martin Kersten says that MonetDB has (on some branch?) column compression, developped for the SkyServer, with more compression schemes. Are there interesintg schemes "missing" from Vectorwise? Q: Martin Kersten says that MonetDB has (on some branch?) column compression, developped for the SkyServer, with variable-length column chunks. What are the pros and cons of this? A: Pros - potential benefit in compression ratio, but how high can it be, really? Cons - Oh, lots; lack of uniformity of processed chunks; mis-alignment of chunks fromm different columns, probably more branching in the code etc. * When this was written, authors were not focused on execution on the compressed representation. Q: A storage manager, or column manager, seems to play an important part in a DBMS. But it it always distinct? Or at least, can it always be made a distinct independent component of the DBMS? Q: What about non-temporal store commands (MOVNT and such)? Aren't they important for compression and decompression without lookback?